3.2.4 Exercises

1. Run ggplot(data = mpg). What do you see?

ggplot(data = mpg)

An empty plot.

2. How many rows are in mpg? How many columns?

nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11

There are 234 rows and 11 columns.

3. What does the drv variable describe? Read the help for ?mpg to find out.

The drv variable describes which wheels of the car receive power from the engine (f = front-wheel drive, r = rear wheel drive, 4 = 4wd).

4. Make a scatterplot of hwy vs cyl.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = hwy, y = cyl))

5. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = class, y = drv))

The plot is not useful because both variables are categorical.

3.3.1 Exercises

1. What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

The points are not blue because color = "blue" needs to be outside of aes to set the aesthetic manually.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

manufacturer, model, trans, drv,fl, and class are categorical, while displ, year, cyl, cty, and hwy are continuous. You can see this information by running mpg and looking at the variable types under the column headings (chr indicates categorical, while dbl or int indicates continuous).

mpg
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
##  1 audi         a4        1.80  1999     4 auto(l… f        18    29 p    
##  2 audi         a4        1.80  1999     4 manual… f        21    29 p    
##  3 audi         a4        2.00  2008     4 manual… f        20    31 p    
##  4 audi         a4        2.00  2008     4 auto(a… f        21    30 p    
##  5 audi         a4        2.80  1999     6 auto(l… f        16    26 p    
##  6 audi         a4        2.80  1999     6 manual… f        18    26 p    
##  7 audi         a4        3.10  2008     6 auto(a… f        18    27 p    
##  8 audi         a4 quat…  1.80  1999     4 manual… 4        18    26 p    
##  9 audi         a4 quat…  1.80  1999     4 auto(l… 4        16    25 p    
## 10 audi         a4 quat…  2.00  2008     4 manual… 4        20    28 p    
## # ... with 224 more rows, and 1 more variable: class <chr>

3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

color

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = displ))

size

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = displ))

shape

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = displ))
## Error: A continuous variable can not be mapped to shape

Whereas continuous variables are mapped to a spectrum of colors, shapes, or sizes, categorical variables are separated into discrete groups (as shown below for the color aesthetic).

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

4. What happens if you map the same variable to multiple aesthetics?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = displ, size = displ))

Each of the aesthetics is mapped for that variable (there are multiple legends).

5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), stroke = 1, shape = 21)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), stroke = 5, shape = 21)

The stroke aesthetic modifies the border thickness of shapes that have a border.

6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))

Each point is colored TRUE or FALSE based on if the statement the aesthetic is mapped to is true or false for that point (in this case, the points with an engine displacement less than 5 are colored TRUE while the points with an engine displacement greater than or equal to 5 are colored FALSE).

3.5.1 Exercises

1. What happens if you facet on a continuous variable?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ cty, nrow = 3)

There will be a subplot displayed for each unique value of the continuous variable (the number of subplots displayed can potentially be very large).

2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl)) 

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl)) + 
  facet_grid(drv ~ cyl)

The empty cells indicate that there are no data points with that particular combination of variables (for example, there are no cars with 4 cylinders that have rear wheel drive).

3. What plots does the following code make? What does . do?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

. is used to facet in a single dimension. facet_grid(drv ~ .) will result in a N x 1 grid, while facet_grid(. ~ cyl) will result in a 1 x N grid, where N is the number of unique values of the variable.

4. Take the first faceted plot in this section. What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

Advantages to using faceting instead of the colour aesthetic: Enables visualization of patterns/trends within a particular facet

Disadvantages: Difficult to visualize global trends

With a larger dataset, the color aesthetic may not be practical as points may overlap and it may be difficult to distinguish certain colors.

5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

nrow sets the number of rows in the faceted plot, while ncol sets the number of columns in the faceted plot. dir and as.table also control the layout of the individual panels - dir determines if the plot is filled in horizontally or vertically, while as.table determines if the highest value facets are at the bottom-right or at the top-right. In facet_grid(), the number of rows and columns is implied by the variables in the parentheses (first variable determines number of rows, second variable determines number of columns).

6. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

This will cause the plot to be larger in the vertical dimension than in the horizontal dimension, and will thus prevent the plot from being compressed in the horizontal dimension (since there is less viewing space horizontally).

3.6.1 Exercises

1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

To draw a line chart, you would use geom_line(), to draw a boxplot, geom_boxplot(), to draw a histogram, geom_histogram(), and to draw an area chart, geom_area().

2. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

The output will be a scatterplot with engine displacement on the x axis and highway miles per gallon on the y axis (negative correlation). Both the points and the smooth lines will be colored based on whether the car is front-wheel drive, rear wheel drive, or four wheel drive.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

3. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

show.legend = FALSE prevents the legend from being displayed. It was used earlier in the chapter to ensure all three plots in the example had the same format.

4. What does the se argument to geom_smooth() do?

If se = TRUE (the default) then there is a confidence interval drawn around the smooth line. If se = FALSE then there is no confidence interval drawn.

5. Will these two graphs look different? Why/why not?

These graphs will look exactly the same. They are using the same dataset and the same mapping conditions, the only difference is that in the first code block the mappings are global mappings that apply to each geom in the graph and in the second code block the mappings are local mappings for a specific layer.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess'

6. Recreate the R code necessary to generate the following graphs.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(mapping = aes(group = drv), se = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = drv)) + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = drv)) + 
  geom_smooth(mapping = aes(linetype = drv), se = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(color = "white", size = 4) + 
  geom_point(aes(color = drv))

3.7.1 Exercises

1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

The default geom associated with stat_summary() is geom_pointrange().

Rewritten:

ggplot(data = diamonds) + 
  geom_pointrange(mapping = aes(x = cut, y = depth),
    stat = "summary",
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

2. What does geom_col() do? How is it different to geom_bar()?

geom_col() and geom_bar() both create bar charts, but with geom_bar() the height of the bar is proportional to the number of cases in each group (uses stat_count to count the number of cases at each x position), whereas with geom_col() the heights of the bars are themselves values in the data (no counting necessary - uses stat_identity).

3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

geom_abline and stat_abline, geom_hline and stat_hline, geom_vline and stat_vline, etc… Most geom and stat pairs have similar names.

4. What variables does stat_smooth() compute? What parameters control its behaviour?

stat_smooth() computes y (predicted value), ymin (lower pointwise confidence interval around the mean), ymax (upper pointwise confidence interval around the mean), and se (standard error). Many parameters control its behavior, including method (which defines the smoothing method to use), se (which defines whether or not to display a confidence interval), and level (which defines the level of confidence interval to use).

5. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

Without group = 1:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

We need to set group = 1 to specify that all of the data should be regarded as one group. Otherwise, each cut is considered a separate group and we get proportions of 1 everywhere.

With group = 1:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop.., group = 1))

3.8.1 Exercises

1. What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

Many points appear to overlap each other (overplotting). You could improve the plot by adding jitter, which will add a small amount of random noise to each point.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point() + 
  geom_jitter()

2. What parameters to geom_jitter() control the amount of jittering?

width and height control the amount of jittering.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point() + 
  geom_jitter(width = 5, height = 10)

3. Compare and contrast geom_jitter() with geom_count().

Both geom_jitter() and geom_count() are used to manage overplotting. While geom_jitter() adds a small amount of random noise to the location of each point, geom_count() counts the number of observations at each location, mapping count to point area.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point() + 
  geom_jitter()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point() + 
  geom_count()

4. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

The default for geom_boxplot() is position_dodge.

ggplot(data = mpg, mapping = aes(x = drv, y = hwy, color = class)) +
  geom_boxplot()

Looks the same:

ggplot(data = mpg, mapping = aes(x = drv, y = hwy, color = class)) +
  geom_boxplot(position = "dodge")

Do not look the same:

ggplot(data = mpg, mapping = aes(x = drv, y = hwy, color = class)) +
  geom_boxplot(position = "jitter")

ggplot(data = mpg, mapping = aes(x = drv, y = hwy, color = class)) +
  geom_boxplot(position = "identity")

3.9.1 Exercises

1. Turn a stacked bar chart into a pie chart using coord_polar().

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), width = 1)

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), width = 1) +
  coord_polar()

2. What does labs() do? Read the documentation.

labs() is used to change axis labels and legend titles.

3. What’s the difference between coord_quickmap() and coord_map()?

coord_quickmap is a quick approximation that preserves straight lines, while coord_map does not preserve straight lines and thus generally requires more computation.

nz <- map_data("nz")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_map()

nz <- map_data("nz")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

4. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

There is a positive correlation between city and highway mpg. coord_fixed() ensures that one unit on the x axis is the same length as one unit on the y axis, thus the x and y values are directly comparable. geom_abline() adds a reference line to the plot (default values: intercept = 0, slope = 1).